Skip to content

feat(sqlite): add streaming SqliteSidecarBuilder keyed by a column#24

Merged
anoop-narang merged 3 commits into
mainfrom
feat/sqlite-sidecar-stream-builder
May 29, 2026
Merged

feat(sqlite): add streaming SqliteSidecarBuilder keyed by a column#24
anoop-narang merged 3 commits into
mainfrom
feat/sqlite-sidecar-stream-builder

Conversation

@anoop-narang
Copy link
Copy Markdown
Collaborator

What

Adds SqliteSidecarBuilder — an incremental begin() → push_batch() → finish() builder for the SQLite point-lookup sidecar, fed from a RecordBatch stream instead of parquet files on disk.

Unlike SqliteLookupProvider::open_or_build (opens parquet files itself and synthesises monotonic 0..N keys), this builder takes batches one at a time and reads each row's key from a designated column — e.g. a storage engine's native rowid. That lets a caller drive a single pass over a source (fanning each batch out to both a USearch index and this sidecar) without first materialising an intermediate parquet file.

Why

A consumer building a vector index over a non-parquet source (a DuckLake snapshot, addressed by its native rowid) has no single parquet file to hand to open_or_build, and its rows aren't keyed by a dense 0..N ordinal. Rather than make the consumer stage the whole snapshot to a temp parquet (an extra full serialize + re-decode), this entrypoint consumes the batch stream the consumer already has.

Details

  • Bounded memory: one transaction wraps the build; push_batch inserts a batch's rows and returns, accumulating nothing. Dropping the builder before finish() rolls the transaction back, so a half-built table is never persisted.
  • Key column stays INTEGER PRIMARY KEY (the rowid-alias B-tree) — point-lookup performance is unchanged; sparse/non-monotonic keys are fine.
  • Shares the CREATE/INSERT DDL and row-param construction with the parquet build_table via new ddl() / row_to_params() helpers; build_table behaviour is unchanged.
  • USearch side needs no change (already key-agnostic), so this is the only addition.

Tests

4 new tests (sparse-rowid round-trip, abandon-on-drop rollback, UInt64 keys, validation errors) + the existing 7 sqlite_provider tests; full suite (--all-features) green, fmt + clippy --all-targets -D warnings clean.

open_or_build opens parquet files and synthesises monotonic 0..N keys.
Add SqliteSidecarBuilder: an incremental begin/push_batch/finish builder
that consumes RecordBatches and reads each row's key from a designated
column, so a caller can build the sidecar from any batch source (e.g. a
storage engine's native rowid) in a single bounded-memory pass without
materialising an intermediate parquet file.

Shares CREATE/INSERT DDL and row-param construction with the parquet
path via new ddl()/row_to_params() helpers; build_table behaviour is
unchanged. One transaction wraps the build; an abandoned builder rolls
back on drop. Keys are stored as INTEGER PRIMARY KEY, preserving the
B-tree point-lookup performance.
Comment thread src/sqlite_provider.rs
Comment on lines +362 to +368
let ncols = batch.num_columns();
if self.key_col_index >= ncols {
return Err(DataFusionError::Execution(format!(
"key_col_index {} out of range for batch with {ncols} columns",
self.key_col_index
)));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: key_col_index is bounds-checked here but value_col_indices is not — if any entry is >= batch.num_columns(), row_to_params will panic on batch.column(ci) instead of returning a DataFusionError. Worth validating both for consistency so a bad caller index always surfaces as a clean error. (not blocking)

claude[bot]
claude Bot previously approved these changes May 29, 2026
push_batch validated key_col_index but not value_col_indices; an
out-of-range entry would panic in row_to_params on batch.column(ci)
rather than returning a clean DataFusionError. Validate both, and add a
test for the out-of-range value-index case. Addresses PR review nit.
@anoop-narang
Copy link
Copy Markdown
Collaborator Author

Addressed the nit in a289ae4: push_batch now bounds-checks value_col_indices too, returning a clean DataFusionError instead of panicking in row_to_params on an out-of-range index. Added a test covering the out-of-range value-index case.

@anoop-narang anoop-narang merged commit 87e2b73 into main May 29, 2026
6 checks passed
@anoop-narang anoop-narang deleted the feat/sqlite-sidecar-stream-builder branch May 29, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant